Global mapping of RNA-chromatin contacts reveals a proximity-dominated connectivity model for ncRNA-gene interactions

Non-coding RNAs (ncRNAs) are transcribed throughout the genome and provide regulatory inputs to gene expression through their interaction with chromatin. Yet, the genomic targets and functions of most ncRNAs are unknown. Here we use chromatin-associated RNA sequencing (ChAR-seq) to map the global network of ncRNA interactions with chromatin in human embryonic stem cells and the dynamic changes in interactions during differentiation into definitive endoderm. We uncover general principles governing the organization of the RNA-chromatin interactome, demonstrating that nearly all ncRNAs exclusively interact with genes in close three-dimensional proximity to their locus and provide a model predicting the interactome. We uncover RNAs that interact with many loci across the genome and unveil thousands of unannotated RNAs that dynamically interact with chromatin. By relating the dynamics of the interactome to changes in gene expression, we demonstrate that activation or repression of individual genes is unlikely to be controlled by a single ncRNA.

staining.b, Validation of H9 hESCs differentiation by qPCR of key pluripotency state marker genes (OCT4, SOX2, NANOG) and definitive endoderm marker genes (FOXA1, FOXA2, SOX17, CXCR4) in ES and DE cells.qPCR results shown as relative expression levels calculated using the 2 ΔΔCT method with the PBGD housekeeping gene for normalization.Data points represent 3 technical replicates.c, Number of raw reads obtained for each cell type and replicate (left), number of reads left after quality filter for which both the RNA and DNA mapped to the genome (including multimappers, no Q score filtering, middle panel), and final number of chimeric reads with high confidence alignment and annotation (Bowtie2 alignment Q score >=40 on the DNA side, STAR alignment score 255 on the RNA side, single gene annotation, either from Gencode v29 or from the de-novo transcriptome as described Fig. 2).These high quality chimeric reads were used for all the analysis in this paper, except where indicated otherwise.d, Percentage of ncRNA reads originating from specific subtypes of ncRNAs.The ncRNA subtypes are obtained from Gencode v29 and further simplified as indicated in Supplementary Data 7. e, Left group: Breakdown of the chromatin-association score versus expression scatter plot shown in Fig. 1d (ncRNA, right) by subtypes of ncRNAs in ES cells.Right group: same plots as in Fig. 1d but for introns of individual RNAs rather than exons.Data are displayed as in Fig. 1d.f, Replicability of the ChAR-seq maps.ChAR-seq maps for individual replicates in ES and DE cells are shown at two different resolutions as indicated.All the maps are shown as introns and exons of individual genes summed together.Color indicates contact per genomic kb per million reads (CPKM).g, Cross-correlation of the 100kb resolution maps across cell lines and replicates, and separately for RNAs originating from exons, introns, and UTL.The cross-correlation for a pair of maps is computed as the Pearson correlation of the pixel intensities in the maps (i.e., the contact rates between an RNA and a 100kb genomic interval).Maps with all chromosomes and all RNAs expressed above 0.1 FPM in the caRNA transcriptome are used to compute the correlations.h, Distance-dependent interaction curve, showing the likelihood for an mRNA exon to contact a genomic site as a function of the RNA travel distance, defined as the distance between the RNA transcription locus (mapping coordinate of the RNA-derived side of the corresponding ChAR-seq read) and the target DNA locus (mapping coordinate of the DNA-derived side of the ChAR-seq read)  For the bridge finding step, "None" indicates no bridge sequence was found in the raw read, and "Multiple" indicates several occurrences of the bridge sequence were found in the raw read.Only reads with a single occurrence ("Single") of the bridge sequence were kept.For the length filtering step, short RNA (resp.short DNA) indicates that the RNA (resp.DNA) side of the read was shorter than 15bp.c, Details on the alignment step, showing the percentage of reads that were discarded during the alignment of either the RNA or DNA side, due to too many mapped locations or no mapping location (as reported by STAR and Bowtie2, respectively).Percentages shown are relative to the number of reads left after the rRNA decontamination step.Reads that passed the alignment filtering step on both the RNA and DNA side were selected for the pairing stage.Reads that passed only the RNA or only the DNA alignment filter or none of them were discarded at the pairing stage.The RNA annotation bar plot shows the percentage of reads relative to the number of reads passing the RNA alignment stage as described above, which overlapped exclusively to exons, or to introns (including exons-introns junctions) defined in Gencode V29.Intergenic indicates reads that were not contained within a gene body or were antisense to a gene.

2 E S r e p l ic a t e 1 D E r e p l ic a t e 1 D E r e p l ic a t e 2 ESupplementary Figure 1 .
side (X-axis): 100kb resolution tiling of Chr7 & Chr8 : top 200 RNAs (by expression) on Chr 7 & 8 RNA side (Y-axis) side (X-axis) : 10kb resolution tiling of Chr7 80Mb-127Mb (Zoom In) : Top 50 RNAs (by expression) with locus in region Chr7 80Mb-127Mb RNA side (Y-axis) Validation of H9 hESCs differentiation into definitive endoderm, ChAR-seq reads statistics, chromatin enrichment details, and replicability of the ChAR-seq maps, related to Fig. 1. a, Validation of H9 hESCs differentiation into definitive endoderm by widefield immunofluorescence.H9 cells are stained against Sox2, Sox17, FoxA2 and Nanog, pre-and post-differentiation. Bar plots show quantification of Sox2 and Sox17

Supplementary Figure 3 .
Reprocess using gene models from de-novo annotated transcriptsRed arrows indicates RNA & DNA side processed together as a pair (to maintain readID match line by line on RNA and DNA) Dotted orange arrow indicates add-on to pipeline to annotate reads that don't overlap with Gencode Genes after de-novo transcriptome assembly (STAR methods) RNA Side fastQ Supplementary Figure2.Diagram of the ChAR-seq preprocessing computational pipeline, showing the processing steps and associated tools.The pipeline takes the fastq files from the sequenced ChAR-seq libraries as input, and outputs "pairs" files which essentially contain the RNA and DNA coordinates of each contact, along with an annotation of the genes from which the RNA originates.ChAR-seq libraries QC: reads statistics at various stages of the preprocessing pipeline.a, Loss of reads not meeting quality filters during the preprocessing pipeline.Left panel shows the percentage of reads left after each stage of the processing pipeline, relative to the initial number of raw reads.Right panel shows the fraction of reads lost at each stage of the pipeline, relative to the number of reads left after the prior stage.b, Details on the bridge finding and length filtering, showing the breakdown of the reasons why reads were filtered out at this stage.

Supplementary Figure 4 .
smFISH validation of the localization patterns of select RNAs, related to Fig. 1. a, Representative localization patterns of GAPDH, XACT, MALAT1, and RMRP in individual cells, as measured by smiFISH(Tsanov et al. 2016).b, Chromatin localization profiles determined by ChAR-seq in ES cells of the RNAs shown in a.

1 D E r e p l ic a t e 1 D E r e p l ic a t e 2 ESupplementary Figure 6 .
e p l ic a t e S r e p l ic a t RNA-DNA contacts of UTLs are more dynamic during differentiation than those of annotated RNAs, related to Fig. 3. a, Number of RNA-DNA interactions tested for differential representation in ES versus DE cells at 1 Mb DNA resolution (left) and 100 kb DNA resolution (right), by class of RNA.b, Quantification by RNA class of the percentage of interactions upregulated in DE or ES cells amongst all interactions tested in that class (interactions with >10 counts in at least one replicate in ES or DE), at 1 Mb DNA resolution.Same analysis as in Fig. 3b, but at 1 Mb resolution on the DNA side rather than 100 kb resolution.c, Cross-correlation of the RNA-DNA contacts maps of UTLs at 100kb resolution.Correlations are computed as in Supplementary Fig. 1f.d, Percentage of differential interaction not explained by differential RNA expression, at 100 kb resolution, relative to the total number of interactions tested within the RNA class.

Supplementary Figure 7 .
Calibration of trans-delocalization scores using a Generalized Linear Model a, Uncalibrated trans-delocalization score for individual mRNAs exons (top) or introns (bottom) as a function of their expression and chromosome of origin (chr1, chr10, and chr11 shown as subpanels), and GLM fit (red line).GLM is defined in equation S3 b, Distribution of the raw trans-delocalization scores of mRNAs for the two ChARseq replicates in ES cells as defined in equation S1, and of the calibrated trans-delocalization scores as defined in S5.The raw trans-delocalization scores show sample biases which are regressed out after calibration.f ES trans-delocalization score DE trans-delocalization score DE cis-delocalization score ES cis-delocalization score Ultra localized ES or DE only Delocalized ES & DE Delocalized ES or DE only Ultra localized ES & DE trans-delocalization score (lncRNAs exons) cis-delocization score (lncRNAs exons)

Supplementary Figure 8 .Supplementary Figure 9 .
Additional data on delocalization scores and broadly localized RNAs, related to Fig. 4. a, Distribution of trans-(left panel) and cis-(right panel) delocalization scores for introns of mRNAs, lncRNAs, and ncRNAs.Bars below each distribution represent median and interquartile range.b, Distribution of the difference in trans-delocalization score between the exons and introns of the same RNA and as function of the type of RNA (mRNA or lncRNA) and cell type.Percentages indicate the percentage of RNAs for which the exonic trans-delocalization score is larger than the intronic trans-delocalization score.c, Absolute number of RNAs in each category classified as delocalized (either cis-or trans-localized at FDR 0.05).d, Heat maps showing the cis and trans delocalization scores and chromatin abundance in ES and DE cells for the 20 most abundant lncRNAs (excluding those identified as cis-or trans-delocalized).e, Metagene plots showing the levels of several RNA categories (mRNAs, snRNAs) and individual RNAs (7SK, MALAT1, VTRNA1-1, RMRP) near genomic loci with PolII ChIP-seq peaks.PolII peaks were obtained from GSE105028(Lyu et al. 2018).For each panel, the top plot shows the pileup of the specific RNA or RNA group (blue line), and the pileup for the background RNAs, defined as the total signal of all mRNAs localized on trans-chromosomes, which captures DpnII site density bias, accessibility bias, and other forms of non-specific localization bias (Methods and Supplementary Note 4).Each signal is displayed as a contact density (CPKM = contacts per 1k genomic bp per 1 million reads), normalized by the median metagene background contact density in 10kb bins in a 1Mb region around the feature, (so that the background decays to 1.0 far from the feature center).The bottom plot shows the RNA fold enrichment over background.f, Scatterplots showing the trans-(left) and cis-delocalization scores (right) for individual lncRNAs in DE versus ES cells.Black lines show linear regression output.Expression-distance model predicts >99% of RNA-chromatin contacts, related to Fig. 5. a, Cross-correlation of the 100 kb resolution RNA-DNA contact maps obtained from the true ChAR-seq data and predicted by the generative model.Cross-correlation matrices are shown separately for the RNAs from introns, exons, and for UTLs.The cross-correlation for a pair of maps is computed as in Supplementary Fig. 7b.b, Number of interactions tested for enrichment over model (left), number of interactions identified as over-represented in the observed data compared to the model (significant interactions, middle), and proportion of significant interactions in relation to the total number of tested interactions in each RNA category (right).For this analysis, interactions are defined at 100 kb resolution on the DNA side.Cis (resp.trans) indicate interactions where the RNA and DNA are on the same (resp.on a different) chromosome.c, Same quantification as in b, but broken up by RNA type.d, Scatter plot showing for each lncRNA-gene contact its observed log2 fold change during ES to DE cells differentiation (LFC) by ChAR-seq, and its predicted LFC by the generative model.Contacts are defined as in Fig. 7a-b.Color indicates the LFC of the lncRNA level on chromatin (shrunken estimate computed using DESeq2 applied to the RNA-side of the ChARseq data, Supplementary Data 1).e, Distribution of the RNA-DNA travel distance for interactions significantly above the model.Related to Fig. 5f but broken up by type of RNA.

. Detailed composition of the unannotated transcribed loci (UTLs), related to Fig. 2. a,
Percent of UTLs reads originating from each subtype of repeat (left, relative to reads annotated as repeat-derived), and each subtype of Cis-regulatory elements (right, relative to reads annotated as CRE-derived).Cis-regulatory elements are classified using the 7-group classification from the Encode Registry of Regulatory Elements(ENCODE Project Consortium et al. 2020) (pELS=proximal Enhancer Like Sequence, dELS=distal Enhancer Like Sequence, PLS = Promoter Like Sequence, see STAR Methods).b, Diversity of the UTLs.Bar plots show the absolute number of RNAs expressed at FPM above 0.1 for each Gencode type of exons and introns, and for each type of UTL.The FPM value refers to the maximum of the ES and DE FPM values."In transcriptome" values refer to the total number of RNA of each type in either GencodeV29 or in the catalog of UTLs generated in this study (Supplementary Data3) c, Distribution of the expression level of the UTLs compared to exons of annotated mRNAs, lncRNAs and ncRNAs.d, Subcellular localization, determined by single molecule RNA-FISH, in ES and DE cells of two cell-state specific UTLs: UTL 86978 (left panels) and UTL 61578 (right panels).DNA staining is shown in the first column, RNA-FISH signal in the second column, and a merge in the third column.ES cells are in the top row and DE cells in the bottom row.Scale bar = 10μm.